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Abstract — One issue of real interest in the area of 
web data mining is to capture users’ activities during 
connection and extract behavior patterns that help de- 
fine their preferences in order to improve the design of 
future pages adapting websites interfaces to individual 
users. This research is intended to provide, first of all, 
a presentation of the methodological foundations of the 
use of probabilistic languages to identify relevant or 
most visited websites. Secondly, the web sessions are 
represented by graphs and probabilistic context-free 
grammars so that the sessions that have the highest 
probabilities are considered the most visited and most 
preferred, therefore, the most important in relation to 
a particular topic. It aims to develop a tool for proces- 
sing web sessions obtained from a log server represen- 
ted by probabilistic context-free grammars. 

Keywords— Probabilistic Grammars, Navigation Pat- 
terns, Pattern Learning Hypertext Probabilistic Gram- 
mar, Hypertext, Information Retrieval. 


Resumen— Uno de los problemas que reviste real inte- 
res en el area de minerla de uso de la web es capturar las 
actividades de los usuarios durante su conexion y extraer 
patrones de comportamiento que permitan definir sus pre- 
ferences con el fin de mejorar el diseno de futuras paginas 
adaptando las interfaces de los sitios web a los usuarios 
individuals., En esta investigacion se pretende ofrecer en 
primer lugar una presentacion de los fundamentos metodo- 
logicos del uso de lenguajes probabillsticos para identificar 
sitios web mas relevantes o visitados. En segundo lugar 
se representa las sesiones web mediante grafos y gramati- 
cas libres de contexto probabilisticas de tal forma que las 
sesiones que tengan mayor probabilidad son consideradas 
las mas visitadas o mas preferidas, por tanto las mas re- 
levantes en relacion a un topico determinado. Se pretende 
desarrollar una herramienta para procesamiento de sesio- 
nes web obtenidas a partir de log de servidor representado 
mediante gramaticas probabilisticas libres de contexto. 

Palabras claves— Gramaticas probabilisticas, patro- 
nes de navegacion, aprendizaje de patrones, gramatica 
probabillstica de hipertexto, hipertexto, recuperacion de 
informacion. 
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I. Introduction 

What science and technology have achieved so far has 
been truly spectacular. We just have to look around to 
witness what the extraordinary power of our unders- 
tanding of nature has helped us achieve. In the early 
eighties the first text mining endeavors were made 
with the inconvenience of needing a lot of human 
effort, but technological advances have allowed this 
area surprisingly progress in the last decade. Text 
mining is a multidisciplinary area based on informa- 
tion retrieval, data mining, machine learning, statis- 
tical and computational linguistics. Like most of the 
information (over 80%) is currently stored as text, it 
is believed that text mining has great commercial va- 
lue. When users browse the Web and want to retrieve 
pages in relation to a particular concept, they should 
avoid many irrelevant pages; the objective is therefo- 
re to recover significant pages, that is, those that are 
authority on the subject. 

There are two related concepts: most relevant 
and most visited pages. Therefore, we start from the 
premise that the most relevant pages are those that 
are most visited. This research captures, from the 
information contained in the server logs, the users’ 
activities during their connection to the web and ex- 
tracts behavioral patterns that will help understand 
the preferences of users’ browsing, allowing adapting 
the interfaces of future pages to individual users. To 
achieve this purpose, a simple model of hypertext re- 
presented by graphs was used; that is, a represen- 
tation of the users’ navigation sessions which were 
inferred from the log files as a hypertext probabilistic 
grammar. 


The referred automated systems require speed, con- 
sistency, accuracy and ease of use in the retrieval of 
relevant texts to satisfy users’ queries. 

B. Web Mining 

There is a growing need to know how users interact 
with websites. Web mining (WM) essentially concerns 
with the discovery and analysis of users’ information 
on the web in order to uncover behavior patterns. Al- 
civar refers to the term WM as technology used to dis- 
cover non-obvious information from data sources that 
include server logs [2] . 

C. Formal Language 

Although a natural language is governed by gram- 
matical rules that are already defined, they can be 
modified later (see Fig.l). This is an advantage for na- 
tural language, because this possibility enriches lan- 
guage, yet at the same time, it hinders its computer 
processing since it can be ambiguous and imprecise. 
On the contrary, a formal language is unambiguous 
and exact; it is a language developed by man to ex- 
press situations that occur specifically in each area of 
scientific knowledge. Formal languages can be used to 
model a theory of mechanics, physics, mathematics, 
electrical engineering, or otherwise, with the advanta- 
ge that in these languages all ambiguities are elimina- 
ted. Of particular importance are computer program- 
ming languages which are defined considering a set of 
lexical components, grammatical rules and semantic 
delimitation [3], [4]. 


II. Objectives 

A. General Objective: To obtain a tool to identify the 
preferences of users on the Web. 

B. Specific Objectives: 

1. To represent the web session by directed graphs. 

2. To represent web sessions using hypertext pro- 
babilistic context-free grammars. 

III. Conceptual Framework 

A. Information Retrieval 

Information retrieval (IR) is a term used in a very 
broad sense that requires precision; it is often vaguely 
defined, and in this context refers only to automated 
information retrieval systems. Contreras points out 
in her thesis [1] that: 

“Lancaster provides a definition: ‘ An information 
retrieval system does not inform (i.e., change the 
knowledge of) the users on the subject of their 
inquiry. It merely informs on the existence (or non- 
existence) and whereabouts of documents relating 
to their request.” 





Fig. 1 . Grammar and language 
Source: Author 

1. Definition of Alphabet A: An alphabet A is defined 
as a finite set of symbols. The elements of an alpha- 
bet constitute the basic units or primitives of a lan- 
guage. These, in turn, are grouped into strings [5], 
[6] 

2. Definition of Word: It is called string or word on an 
alphabet A, to a finite sequence of elements of A [7] 

D. Grammar 

A grammar G is a linguistic and mathematical model 
that describes the syntactic order to be met by well- 
formed sentences of a language [8], [9]. A grammar is 
formally defined as in (1): 

G = (V t ,Vn,P,S) (1) 

Where: 


Grammatical 

Rules 


Natural 

Language 
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V T . finite set of terminal symbols of language 
V N . finite set of non-terminal symbols 
P: finite set of production rules 

SgV n . distinguished symbol or initial axiom 
From axiom S, sequences L are recognized by 
applying successively the rules on production grammar. 


The term hypertext refers to the organization sys- 
tem and presentation of data based on the linking 
of text fragments or graphics to other fragments, 
allowing the user to access information not neces- 
sarily sequentially but from any of several related 
items, as shown in the Fig. 3. 


E. Probabilistic Context-Free Grammar 

Chomsky classified grammars according to the form of 
its production rules, thus a context-free grammar has 
its rules as follows: 

P: A — ► a 

Where: 

AEVn YaE (V N U V T ) 

The left side contains only a non-terminal, while the 
right side consists of a sequence of terminals and non- 
terminals [3], [8]. 

A probabilistic context-free grammar (PCFG) is a 
context-free grammar in which each rule is assigned 
a probability. The probability of a parsing is the pro- 
duct of the probabilities of each of the rules used in it. 
Thus there are analyses that are more consistent than 
others. Note that the PCFG extend the contexts-free 
grammars incorporating a probability function [2], [10]. 

A PCFG is then defined as fivefold G = (V T Vn P, S, 
£) where £ is a function to assign probabilities to each 
rule in P. Function £ expresses the probability that a 
non-terminal given will be expanded to sequence 6. A 
probabilistic grammar has for each rule P a conditional 
probability. 

[p] 

1. Assign Probabilities to Every Production Rule: Af- 
ter defining the grammar, a probability is assigned 
in each production rule (see Fig. 2) 

Consider the following example taken from [3] 



Fig. 3. Hypertext 
Source: Author 


F. Hypertext Navigation 

To understand more clearly the nature of navigation 
through the information hyperspace, it is necessary 
to decompose the problem as several authors have 
tried. In this sense, there is a discrepancy in the 
classification made by Wright and Lickorish, with 
the references [2] [11]; internal navigation, is what 
is part of the hypertext; and external, the one made 
possible by generic navigation tools, independent of 
hypertext. Hypertext navigation refers to the pro- 
cess of moving through multiple pages when you 
visit the Web. 

G. Hypertext Probabilistic Grammar 


NAME VERB 

M 

s -» NOMBREVERBO 

[ 1 . 0 ] 

NAME ADJ NAME 

SOMBRE -> ADJ SOMBRE 

[ 0 . 4 ] 

NAME ADJ NAME — SING 

SOMBRE -* ADJ SOMB - SISG 

[ 0 . 6 ] 

VERB B — SING ADVERB 

VERBOVER -» B - SISG ADVERBIO 

[ 1 . 0 ] 

ART THE 

ADJ -» El 

[ 0 . 25 ] 


ADJ -* La 

[ 0 . 25 ] 


ADJ -* Los 

[ 0 . 15 ] 


ADJ -> Las 

[ 0 . 15 ] 


ADJ -* Esos 

[0.10] 

DM THOSE 

ADJ -» Pequeho/traviesa 

[ 0 . 10 ] 

ADJ SMALL 1 ' NAUGHTY 

SOMB - SISG -» nifio 

[ 0 . 50 ] 

NAME — SING — BOY 
NAME — SING —GIRL 

SOMB - SISG -> nitia 

[ 0 . 50 ] 

VERB — SING — STUDIES 

VERB -SISG -* estudia 

10 . 27 ] 

VERB— SING— RUNS 

VERB -SISG -* corre 

[ 0 . 16 ] 

VERB— SING— PLAYS 

VERB - SISG -» juega 

[ 0 - 34 ] 

VERB — SING— JUMPS 

VERB -SISG -* salta 

[ 023 ] 

ADVERB — FAST 

ADVERVIO -* rapidamente 

[ 0 . 45 ] 

ADVERB — SLOW/SLOWLY 

ADVERVIO -* despacio 

[ 0 . 28 ] 

ADVERB — MUCH 

ADVERVIO -* mucho 

[ 0 . 27 ] 


} 


Fig. 2. Grammar with probabilities 
Source: [3] 


A hypertext probabilistic grammar (HPG) is defi- 
ned as G = (V T V N P, S, £) and a regular grammar, 
defined by a regular expression, has a one-to-one 
relationship between V N and V T 

Hernandez noted [2] that the sessions of users’ 
navigation inferred from the log files can be repre- 
sented as a hypertext probabilistic grammar. Each 
non-terminal symbol belonging to G corresponds to 
a visited page each derivation rule corresponds to 
a link between pages. Thus, the rule A to B means 
the transition from page A to page B. In this re- 
gard, it is important to note that this method con- 
sists of the fact that the strings generated by the 
grammar with the highest probability correspond 
to the users’ preferred paths [12]. 

The probability of a grammar string is the pro- 
duct of the probabilities of the productions used in 
its derivation [11]. 
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H. Web Server Logs 

Essentially server logs consist of one or more text 
files that are automatically created and managed 
by a server, where all activity that is done on it is 
stored. Each server, depending on its implementa- 
tion and / or configuration may or may not create 
a particular log. One of the most typical logs is the 
access log of a web server that stores in each access 
and at the same time data such as an IP address, 
browser, date and time, etc., allowing the creation 
of the website statistics [2] [13]. 

IV. Methodology 

The research was conducted with a sample of the 
server log files from the computer lab of the Sys- 
tems Engineering Faculty. Using these files, a hy- 
pertext grammar (HG) was built; for this purpose, 
It was determined the number of times a particu- 
lar grammatical rule was applied and statistical 
calculations were done by estimating the frequency 
in which the pages appear in the navigation ses- 
sion. For this purpose, each non-terminal symbol 
of HG corresponds to a page and each derivation 
rule to a transition from one page to another; then 
the probabilities of each of the production rules 
were assigned. To model the navigation sessions, 
a graph was constructed; and finally, a Java pro- 
gram was developed using the platform NetBeans 
IDE 7.3. 

A. Grammar Definition 

Grammar G was defined identifying the termi- 
nals, non-terminals symbols and derivation rules. 
A non-terminal symbol was assigned to each iden- 
tified page. 

B. Definition of Grammar H PG 

The probability of each production rule associated 
to grammar is calculated. 

C. Definition of Navigation Sessions 

Using the server logs, a set P containing the navi- 
gation sessions was constructed. 

D. Session Graph Construction 

Sessions were modeled by a graph structure G. 

E. Implementation 

A prototype was constructed to identify the most 
relevant pages. 



V. Results 

A. Definition of Hypertext Probabilistic Grammar 

Using the navigation session set P obtained from the 
server log files, the identified pages were represen- 
ted by non-terminals symbols of G. 

Production rules are displayed in Fig. 4, where 
the line is labeled with the probability P.. resulting 
from derivation Ai to Aj 



The next step was to perform statistical calcula- 
tions to assign probabilities (see Table I). After de- 
termining the number of times pages were linked, it 
was calculated all middle and conditional probabili- 
ties and the number of times that a grammar rule 
has been applied. 


Table I. Determination of Probabilities 


Rule 

Ocurrence 

of 

Ocurrence of 

Probability 

S— >A1A1 

100 

12 

0.12 

S— >A2A2 

100 

3 

0.03 

S— >A3A3 

100 

8 

0.08 

S— >A4A4 

100 

9 

0.09 

S— >A5A5 

100 

25 

0.25 

S— >A6A6 

100 

33 

0.33 

S— >A7A7 

100 

10 

0.10 





A6— >A2A7 

50 

16 

0.32 

A6— >A2A7 

50 

34 

0.68 

A7— >F 

15 

15 

1.00 


Source: Author 


Then grammar G was expanded to a grammar 
HPG. The productions are distinguished into two 
types: 

1. Start Productions', those that begin with axiom 
(S) and represent the start of a session. 

2. Transitive Productions: Those that start with a 
non-terminal different from S and correspond to 
the links between pages [2]. 

Table II shows the grammar with its probabili- 
ties: 
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Table II. Grammar with Probabilities Table III. Sessions of Navigation 


1) S -> alAl(0.12) 

14) A2 — ► a5A7 (0.32) 

2) S— >■ a2A2 (0.03) 

15) A4— >a5A5 (0.26) 

3) S — > a3A3 (0.08) 

16) A3— >a2A4 (0.63) 

4) S -> a4A4 (0.09) 

17) A3 — ► a5A6 (0.37) 

5) S— > a5A5 (0.25) 

18) A5— >a3A6 (0.23) 

6) S— »• a6A6 (0.33) 

19) A5— > a2Al (0.30) 

7) S— > a7A7 (0.10) 

20) A6 — * a2A7 (0.32) 

8) Al — >a2A3 (0.35) 

20) Al— ► F(0.30) 

9) Al— ► a4A4 (0.12) 

21) A4 — > F(0.57) 

10) Al— >a3A7 (0.23) 

22) A5 -> F(0.47) 

11) A4 — >a2A6 (0.17) 

23) A6— >F(0.68) 

12) A2— > a2A3 (0.23) 

24) A7— > F(0.10) 

13) A6 — >a4A2 (0.45) 



Source: Author 


B. Sessions Graph 


Production rules are shown in the following graph 
(see Fig 5), where the lines are labeled with the pro- 
bability Pij resulting from derivation Ai to Aj 



ID 

SESSION 

1 

A: -*A: -♦ .A; -► A; -♦ W A* 


\ ♦ \ 

) 

A;— ►A;— ► A| 

4 

At— ♦ Ai^_Aj__ 

5 

Ai-#A; —¥ A| -4 Ar-fr A* 

6 

A; — *Aj — ♦ A< A; 


A* - ♦A* -s Aj ^ At — ► A? 

$ 

A:—^Ai— ► A« Aj—fr A*4 Ai 


Source: Author 


Where: 

Si a session of set P 
Ai a page involved in a session S t 
r i the number of times a page A t was reques- 
ted in the sessions P 

p i the number of times a page A. was the first 
state in a session S t of P. 

u i the number of times a page i was the last 
state in a session S t of P 

t y the number of times a subsequence of two 
pages appears on the session, or what is the same, 
the number of times the link was crossed of P 
a> 0 strings can be generated from any state 
a = 0 only states that took the top places in the 
current sessions have a probability higher than zero 
to be start production 

a = 1 the probability of a start production is pro- 
portional to the number of times the corresponding 
state was visited. The destiny node of a production 
with higher probability corresponds to the state that 
was visited more often 

N: N > 1 determines the user’s memory when navi- 
gating the Web, that is, the number of previous URLs 
that may influence the choice of the following URL 

If N = 1, the result will be what is formally known 
as a Markov string, which is a special type of dis- 
crete stochastic process in which the probability of 
an event occurring depends on the immediately pre- 
ceding event. This lack of memory feature is called 
Markov property as shown in (2) and solved in (3): 

Si N=1 y a = 0 


C. Determination of Sessions Probability 


P{S - > CL^A^ 


a*N-V-A 1 

N—T_V 


+ 


a*N-l-A ± 

N-T-I 


( 2 ) 


As already established, the productions were distin- 
guished into two types: production start and transi- 
tive productions. 

Using grammar strings, representing users’ na- 
vigation sessions (see Table III), a statistical cal- 
culation was made over a collection of navigation 
sessions that yielded the number of times a page ap- 
pears as initial page, the number of times it appears 
as the final page, and the number of times that is 
not initial or final page. From this statistics, a pat- 
tern is obtained. 


Where: 

N-V-Ai: number of visits to Al = 6 
N-S-Ai: number of starts from Al= 4 
T-N_V: total number of visits = 36 
T-N-S: total number of starts = 8 

P(S^a 1 A 1 )= — + — =0.33 (3) 

v 1 Ay 36 8 

Using axiom S, symbols between Ai and A 7 can 
be chosen. Applying the formula, it yields that page 
Ai has higher probabilities to be selected, followed 
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by A 3 A 4 A 5 and A 6 ; A 2 and A 7 are equally probable 
(Table IV). 

Table IV. Production Choice Statistics From Axiom S 


= *v A] TM] MS* j-N-S W 



m 

05 

$ 

ft 

4 

* mm# 

5~> 

JjU 

05 

1 

ft 

0 

£ M277777E 


m 

05 

4 

ft 

2 

£ omm 

5 jj > 

m 

05 

7 

ft 

1 

£ 0, 15572222 

5 > 

m 

0.5 

6 

ft 

1 

% 0,145833ft 

S > 

*m 

05 

6 

ft 

0 

£ 0.OSJ33333 

S-> 

a?A7 

05 

2 

ft 

0 

£ 0M777E 


Source: Author 

This probability is shown in Fig 6 

p(p) 

0.35 

0.3 

0.25 

l.llh. 


Fig. 6. Comparative table of selected 
page probability from axiom S. 
Source: Author 


D. Implementation 

1. Entry and storage of log files on the server (see 
Table V). Using the server log files, a hypertext 
probabilistic grammar is created. 

2. Cleaning of the stored data. Irrelevant data that 
do not transfer content is debugged. 

3. Users’ identification. 

4. Identification of sessions and recognition of pages 
considered as petitions. 

Table V. Log Format 


ID session 

Session identifier 

ID User 

Identifier of user who logs in 

IP 

IP of user who logs in 

Start time 

date and time of user’s log in 

End Time 

Date and time of user’s logout 

NPV 

number of accessed pages in the website 

NS 

total number of requests made during the session 

BD 

Total transferred bytes during the session 


Source: Author 


VI. Conclusions 

This research emphasizes the importance of context- 
free grammars (widely used in language theory) as 
a tool to detect the preferences of website users. This 
instrument allows commercial companies to impro- 
ve their websites to maximize the business impact 
in terms of the dynamic behavior of its visitors. 

The method allowed inferring, from the log fi- 
les, users’ navigation sessions representing them 
through hypertext probabilistic grammar, so that 
the sequences generated or recognized by the gram- 
mar correspond to preferred users’ sessions or paths. 

The main difficulties of building probabilistic 
context free grammar were, first, to build the gram- 
mar, and then assign the probabilities in each pro- 
duction rule. 

The developed model can be used to calculate the 
probability of reaching a page if the user is on a gi- 
ven page. 

There are many tools for websites analysis and 
statistics that together with web servers provide 
really good data views and summaries to generate 
reports and graphs, but do not allow other activities 
like drawing patterns on user behavior or explore 
the relevance and ranking of pages. Our analysis of 
web sessions modeled by context-free grammars is 
equated with the ability to extract and use informa- 
tion from sessions to learn users’ behavior patterns. 
The patterns obtained from past uses can determine 
web customizing, meaning by customization any ac- 
tion that adapts the Web to suit the user. 

Computational linguistics is not only a method 
but a paradigm with a computational scheme of lan- 
guage processing that has led to a wide variety of 
applications, in this case, to the learning of naviga- 
tion patterns. 
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